xarray zarr
Reading
Instead of waiting for full datasets, process data incrementally as it arrives (e.g., from simulations).
code:python
# Pseudocode pattern
for chunk in xr.open_zarr("simulation.zarr", chunks={"time": 1}):
process(chunk) # e.g., accumulate statistics
Treat multi-run, multi-model archives as a tree: DataTree + Zarr v3
If you work with ensembles (different model configurations, perturbations, parameter sweeps), representing them as a hierarchy can be cleaner than forcing everything into one Dataset. Recent xarray releases emphasize DataTree and even mention reading Zarr v3 datasets into a DataTree.
code:python
import xarray as xr
# Example idea: keep each experiment under a node (pseudo-layout)
# /control, /perturbed, /highres ... each a dataset
dt = xr.open_datatree("experiments.zarr", engine="zarr") # layout-dependent
# Compare two runs cleanly
delta_ssh = dt"highres".ds"zos".mean("time") - dt"control".ds"zos".mean("time")
Why it matters in oceanography: it keeps metadata + provenance for runs together while still letting you compute “differences of products” naturally.
IO parallelism tuning beyond chunking (file system + scheduler interplay)
Two underused levers:
File-level parallelism: many small Zarr chunks vs fewer large ones depends on object store vs POSIX.
Scheduler choice: threaded vs processes vs distributed.
Pattern:
code:python
import dask
dask.config.set(scheduler="threads") # good for IO-bound workloads
ds = xr.open_zarr("ocean.zarr", chunks={}) # defer chunking decisions
Empirically:
object storage → more, smaller chunks (parallel GETs)
HPC filesystem → fewer, larger chunks (reduce metadata ops)
Object-store friendly I/O: read only what you need via fsspec, and align access windows with chunk layout
For Zarr or remote datasets, I/O performance is dominated by how your slice pattern matches chunking. If you typically read “time windows + spatial tiles,” design chunking (and your selections) accordingly.
code:python
import xarray as xr
ds = xr.open_zarr(
"s3://bucket/model.zarr",
chunks={"time": 24, "y": 1024, "x": 1024},
consolidated=True,
storage_options={"anon": False}, # adjust to your auth
)
# Example: time window + spatial tile (often best when aligned to chunk boundaries)
sub = (ds"ssh"
.sel(time=slice("2015-01-01", "2015-01-31"))
.isel(y=slice(0, 2048), x=slice(0, 2048)))
hr.icon
Writing
Chunk-shape “tiling” for multi-variable workflows
When multiple variables are used together (e.g., u, v, T, S), misaligned chunks cause repeated rechunking.
Solution: enforce a shared chunk template at write time.
code:python
encoding = {var: {"chunks": (24, 512, 512)} for var in ds.data_vars}
ds.to_zarr("ocean_aligned.zarr", encoding=encoding)
This avoids hidden rechunk costs during multi-variable operations like fluxes or budgets.
Write-time compression tuning (Blosc/Zstd tradeoffs)
Compression is not just about size—it affects read speed.
code:python
encoding = {
"thetao": {
"compressor": zarr.Blosc(cname="zstd", clevel=3, shuffle=2)
}
}
ds.to_zarr("optimized.zarr", encoding=encoding)
Typical pattern:
moderate compression (clevel ~3–5) → best throughput
very high compression → slower reads, often not worth it
Use DataTree for multi-resolution / multi-product ocean archives
A useful emerging pattern is to represent related ocean products as a hierarchy rather than as many loosely connected Datasets: e.g., /raw_sst, /daily_sst, /fronts, /eddies, /climatology, each with its own grid and metadata. Xarray now has first-class DataTree support, and DataTree.chunk() can rechunk arrays across groups; DataTree.to_zarr() / open_datatree() make this natural for hierarchical Zarr stores. This is especially attractive for model–observation matchup archives, nested models, SWOT swath + gridded products, or glider profiles grouped by deployment.
code:python
import xarray as xr
tree = xr.DataTree.from_dict({
"/model/hourly": model_ds,
"/obs/argo": argo_ds,
"/diagnostics/mld": mld_ds,
})
tree = tree.chunk({"time": 30, "lat": 256, "lon": 256})
tree.to_zarr("ocean_matchup_hierarchy.zarr", mode="w")
A practical analytical workflow is to keep raw and derived diagnostics in the same store, but isolate their coordinates and chunking by group. That avoids forcing, say, profile data and gridded SSH onto a single artificial schema.
hr.icon
Appending/Update
Operational Zarr workflows: append and partial updates (append_dim / region)
Instead of rewriting entire archives, design for incremental updates (daily runs, patches, reprocessed tiles). This can drastically reduce I/O.
code:python
# Initial write
ds0.to_zarr("ssh.zarr", mode="w", consolidated=True)
# Daily append along time
ds_new.to_zarr("ssh.zarr", mode="a", append_dim="time", consolidated=True)
# Patch a subregion (overwrite a spatial tile for a specific time slice)
region = {"time": slice(ti, ti+1), "y": slice(2000, 2600), "x": slice(3000, 3600)}
ds_patch.to_zarr("ssh.zarr", mode="r+", region=region)
Tip: align patch regions with Zarr chunk boundaries whenever possible; misalignment can cause extra reads/writes.
Use region writes for streaming model output or daily satellite updates
For operational or semi-operational ocean workflows, append-like writes can be slow and fragile. Xarray’s Dataset.to_zarr(region=...) supports writing into pre-existing Zarr arrays, but the documentation warns that region boundaries, Zarr chunks, and Dask chunks must align; otherwise incomplete chunk writes can corrupt data.
code:python
template = xr.zeros_like(ds_day0).expand_dims(time=pd.date_range("2026-01-01", periods=366))
template.to_zarr(
"sst_daily_2026.zarr",
mode="w",
compute=False,
encoding={"sst": {"chunks": (1, 720, 1440)}},
)
# Later: write one day at a time, aligned to one time chunk
daily_ds.chunk({"time": 1, "lat": 720, "lon": 1440}).to_zarr(
"sst_daily_2026.zarr",
region={"time": slice(day_index, day_index + 1)}
)
This is useful for building rolling marine heatwave, SST-front, sea-ice-edge, or altimetry anomaly archives without rewriting the full store.
#zarr
#xarray